Spatial Analysis of Travel Distances Across Different Modes Using Geographically Weighted Regression (GWR)

Goal

The goal is to understand how travel behavior (e.g., distances traveled for different modes of transportation) varies across different counties through a spatial regression analysis.

Data Sources

This vignette uses a database of 26,095 sample households residing in California, containing detailed information on their travel behavior on one assigned day for each household, from April 19, 2016 through April 25, 2017, provided by the Transportation Secure Data Center (TSDC). [1] Specifically, we obtained travel mode data from PersonData.Rds, HHData.Rdsand geographical data, i.e. the physical location and shape of each county, fromcounties.shp, files given in the dataset.

[1]“Transportation Secure Data Center.” (2019). National Renewable Energy Laboratory. Accessed Jan. 15, 2019: www.nrel.gov/tsdc.

Methodology

Geographically Weighted Regression (GWR):

Geographically Weighted Regression (GWR) is a spatial analysis technique that extends traditional regression by allowing the relationships between dependent and independent variables to vary spatially. Unlike ordinary least squares (OLS) regression, which assumes global stationarity of the coefficients, GWR incorporates geographic context into the model. This approach accounts for spatial heterogeneity, a common characteristic in spatial datasets, where relationships can change over space due to localized factors.

In GWR, the regression is performed repeatedly for each location in the dataset, weighting observations according to their spatial proximity to the focal location. The weighting is determined using a kernel function, which can be fixed or adaptive, depending on the data’s spatial distribution.[2]

Bandwidth Selection:

The selection of an appropriate bandwidth is a crucial step for GWR model. Bandwidth is a parameter that governs the spatial extent, over which neighboring observations influence the estimation of local parameters. The bandwidth serves as a key filter determining the degree of localization in the analysis. 

A bandwidth that is too narrow may lead to oversensitivity to local variations, potentially capturing noise in the data. On the other hand, too broad bandwidths can result in over smoothed representations , masking subtle spatial patterns. With a proper bandwidth value, we are able to achieve the balance to ensure the GWR model accurately captures the true spatial heterogeneity without being unduly influenced by distant observations. 

Adaptive bandwidths offer an effective solution, as they can vary based on the size of each geographical area and that of its neighbors. Thus, the model can select a narrower bandwidth in dense areas, and a larger one for suburban areas. [3]

In our analysis, we determined the optimal bandwidth using the `bw.gwr` function, which is 35.

[2] Charlton, M., & Fotheringham, A. S. (2009). Geographically weighted regression. [White Paper].

[3] Kiani et al.(2024, February 29). Mastering geographically weighted regression: Key considerations for building a robust model: Geospatial Health. Mastering geographically weighted regression: key considerations for building a robust model | Geospatial Health. https://www.geospatialhealth.net/gh/article/view/1271/1365

Data Pre-Processing

library(sf) 
## Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
library(sp)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(GWmodel)
## Loading required package: robustbase
## Loading required package: Rcpp
## Welcome to GWmodel version 2.4-2.
library(ggplot2)
library(mapview)
library(leafpop)

person_data <- readRDS("./data/PersonData.Rds")
hh_data <- readRDS("./data/HHData.Rds")
bg_density <- readRDS("./data/hh_bgDensity.Rds")
shapefile <- st_read("./data/counties/counties.shp")
## Reading layer `counties' from data source 
##   `/Users/sophieshixue/Documents/vignette-geospatial-T_T9/Data/counties/counties.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 58 features and 17 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -124.3926 ymin: 32.53578 xmax: -114.1312 ymax: 42.00952
## Geodetic CRS:  WGS 84
boundaries <- st_read("./data/ca_state_boundaries/CA_State.shp")
## Reading layer `CA_State' from data source 
##   `/Users/sophieshixue/Documents/vignette-geospatial-T_T9/Data/ca_state_boundaries/CA_State.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 1 feature and 18 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -13857270 ymin: 3832931 xmax: -12705030 ymax: 5162406
## Projected CRS: Popular Visualisation CRS / Mercator
head(person_data)
## # A tibble: 6 × 50
##      hhid  pnum  Male   Age persHisp persWhite persAfricanAm persNativeAm
##     <dbl> <dbl> <dbl> <dbl>    <dbl>     <dbl>         <dbl>        <dbl>
## 1 1031985     1     1    74        0         1             0            0
## 2 1031985     2     0    73        0         1             0            0
## 3 1032036     1     1    46        0         1             0            0
## 4 1032036     2     0    47        0         1             0            0
## 5 1032036     3     1    15        0         1             0            0
## 6 1032036     4     1    14        0         1             0            0
## # ℹ 42 more variables: persAsian <dbl>, persPacIsl <dbl>, persOthr <dbl>,
## #   persDKrace <dbl>, persRFrace <dbl>, bornUSA <dbl>, DriverLic <dbl>,
## #   TransitPass <dbl>, Employed <dbl>, WorkFixedLoc <dbl>, WorkHome <dbl>,
## #   WorkNonfixed <dbl>, WorkDaysWk <dbl>, TypicalHoursWk <dbl>,
## #   FlexSched <dbl>, FlexPrograms <dbl>, Disability <dbl>, DisLicensePlt <dbl>,
## #   TransitTripsWk <dbl>, WalkTripsWk <dbl>, BikeTripsWk <dbl>, Student <dbl>,
## #   WorkMode <chr>, SchoolMode <chr>, EducationCompl <chr>, workday <dbl>, …
head(hh_data)
## # A tibble: 6 × 19
##      hhid CTFIP County        MPO    City  DOW   HH_size HH_nTrips HH_nEmployees
##     <dbl> <dbl> <chr>         <chr>  <chr> <chr>   <dbl>     <dbl>         <dbl>
## 1 1031985  6095 Solano        MTC    Vall… Tues…       2         4             0
## 2 1032036  6073 San Diego     SANDAG San … Satu…       5        31             1
## 3 1032053  6047 Merced        Merced Merc… Thur…       6        46             1
## 4 1032425  6083 Santa Barbara Santa… Gole… Mond…       2         0             2
## 5 1032558  6037 Los Angeles   SCAG   Los … Frid…       1         6             0
## 6 1033586  6061 Placer        SACOG  Linc… Frid…       3        10             1
## # ℹ 10 more variables: HH_nStudents <dbl>, HH_nLicenses <dbl>, HH_nCars <dbl>,
## #   HH_nBikes <dbl>, HH_income <dbl>, HH_anyTransitRider <dbl>,
## #   HH_homeType <dbl>, HH_homeowner <dbl>, HH_isHispanic <dbl>,
## #   HH_intEnglish <dbl>
head(bg_density)
## # A tibble: 6 × 3
##      hhid bg_density bg_group
##     <int>      <dbl> <fct>   
## 1 1449245       818. Suburban
## 2 3007304       818. Suburban
## 3 3008384       818. Suburban
## 4 3007273       818. Suburban
## 5 1452535       818. Suburban
## 6 3007437      2843. Urban
head(shapefile)
## Simple feature collection with 6 features and 17 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -123.5363 ymin: 34.89747 xmax: -117.9808 ymax: 38.85292
## Geodetic CRS:  WGS 84
##   STATEFP COUNTYFP COUNTYNS CTFIP            NAME               NAMELSAD LSAD
## 1      06      107 00277318  6107          Tulare          Tulare County   06
## 2      06      009 01675885  6009       Calaveras       Calaveras County   06
## 3      06      047 00277288  6047          Merced          Merced County   06
## 4      06      079 00277304  6079 San Luis Obispo San Luis Obispo County   06
## 5      06      097 01657246  6097          Sonoma          Sonoma County   06
## 6      06      041 00277285  6041           Marin           Marin County   06
##   CLASSFP MTFCC CSAFP CBSAFP METDIVFP FUNCSTAT       ALAND    AWATER
## 1      H1 G4020  <NA>  47300     <NA>        A 12494707314  37391604
## 2      H1 G4020  <NA>   <NA>     <NA>        A  2641820029  43810423
## 3      H1 G4020  <NA>  32900     <NA>        A  5011554680 112760479
## 4      H1 G4020  <NA>  42020     <NA>        A  8543230300 820974619
## 5      H1 G4020   488  42220     <NA>        A  4081430061 497530414
## 6      H1 G4020   488  41860    41884        A  1347585499 797420416
##      INTPTLAT     INTPTLON                       geometry
## 1 +36.2288317 -118.7810618 MULTIPOLYGON (((-118.3606 3...
## 2 +38.1846184 -120.5593996 MULTIPOLYGON (((-120.02 38....
## 3 +37.1948063 -120.7228019 MULTIPOLYGON (((-120.0521 3...
## 4 +35.3852268 -120.4475409 MULTIPOLYGON (((-120.214 35...
## 5 +38.5250258 -122.9376050 MULTIPOLYGON (((-122.513 38...
## 6 +38.0518169 -122.7459738 MULTIPOLYGON (((-123.0233 3...
# Merge Data
combine_data <- left_join(person_data, hh_data) %>% left_join(bg_density)
## Joining with `by = join_by(hhid)`
## Joining with `by = join_by(hhid)`
head(combine_data)
## # A tibble: 6 × 70
##      hhid  pnum  Male   Age persHisp persWhite persAfricanAm persNativeAm
##     <dbl> <dbl> <dbl> <dbl>    <dbl>     <dbl>         <dbl>        <dbl>
## 1 1031985     1     1    74        0         1             0            0
## 2 1031985     2     0    73        0         1             0            0
## 3 1032036     1     1    46        0         1             0            0
## 4 1032036     2     0    47        0         1             0            0
## 5 1032036     3     1    15        0         1             0            0
## 6 1032036     4     1    14        0         1             0            0
## # ℹ 62 more variables: persAsian <dbl>, persPacIsl <dbl>, persOthr <dbl>,
## #   persDKrace <dbl>, persRFrace <dbl>, bornUSA <dbl>, DriverLic <dbl>,
## #   TransitPass <dbl>, Employed <dbl>, WorkFixedLoc <dbl>, WorkHome <dbl>,
## #   WorkNonfixed <dbl>, WorkDaysWk <dbl>, TypicalHoursWk <dbl>,
## #   FlexSched <dbl>, FlexPrograms <dbl>, Disability <dbl>, DisLicensePlt <dbl>,
## #   TransitTripsWk <dbl>, WalkTripsWk <dbl>, BikeTripsWk <dbl>, Student <dbl>,
## #   WorkMode <chr>, SchoolMode <chr>, EducationCompl <chr>, workday <dbl>, …

Group the data by county (CTFIP), and calculates the average total number of miles traveled by a person for each travel mode (Drive Alone, Drive with Others, Passenger, Walk, and Total) on the survey day.

summarized_data <- combine_data %>%
  select(hhid, pnum, DriveAlone_Dist, Driveothers_Dist, Passenger_Dist, Walk_Dist, Bike_Dist, CTFIP, Sum_PMT) %>%
  group_by(CTFIP) %>%
  summarise(
    avg_DriveAlone_Dist = mean(DriveAlone_Dist, na.rm = T),
    avg_Driveothers_Dist = mean(Driveothers_Dist, na.rm = T),
    avg_Passenger_Dist = mean(Passenger_Dist, na.rm = T),
    avg_Walk_Dist = mean(Walk_Dist, na.rm = T),
    avg_Bike_Dist = mean(Bike_Dist, na.rm = T),
    avg_Sum_Pmt = mean(Sum_PMT, na.rm = T))

Plot

shapefile_sum <- shapefile %>%
  left_join(summarized_data, by = "CTFIP")

mapviewOptions(fgb = F)

mapview(shapefile_sum, 
        zcol = "avg_Sum_Pmt", # assigned color based on sum distance
        legend = TRUE,
        label = as.character(shapefile_sum$CTFIP),  
        popup = popupTable(shapefile_sum, 
                           zcol = c("avg_DriveAlone_Dist", 
                                    "avg_Driveothers_Dist", 
                                    "avg_Passenger_Dist", 
                                    "avg_Walk_Dist", 
                                    "avg_Bike_Dist",
                                    "avg_Sum_Pmt")))

Select GWR Bandwidth

Use bw.gwr() to find the optimal bandwidth for GWR analysis.

coords <- st_coordinates(st_centroid(st_geometry(shapefile_sum)))

gwr_data <- shapefile_sum %>%
  select(avg_DriveAlone_Dist, avg_Driveothers_Dist, avg_Passenger_Dist, avg_Walk_Dist, avg_Bike_Dist, avg_Sum_Pmt) %>%
  st_drop_geometry()

gwr_data <- cbind(gwr_data, coords)

gwr_formula <- avg_Sum_Pmt ~ avg_DriveAlone_Dist + avg_Driveothers_Dist + avg_Passenger_Dist + avg_Walk_Dist + avg_Bike_Dist

# convert to spatial data
coordinates(gwr_data) <- ~X + Y
proj4string(gwr_data) <- CRS("+proj=longlat +datum=WGS84")

# Perform bandwidth selection for GWR
bw <- bw.gwr(
  formula = gwr_formula,
  data = gwr_data,  
  adaptive = T)
## Adaptive bandwidth: 43 CV score: 36.66803 
## Adaptive bandwidth: 35 CV score: 36.27212 
## Adaptive bandwidth: 28 CV score: 40.56964 
## Adaptive bandwidth: 37 CV score: 37.88228 
## Adaptive bandwidth: 31 CV score: 40.03792 
## Adaptive bandwidth: 34 CV score: 36.77004 
## Adaptive bandwidth: 31 CV score: 40.03792 
## Adaptive bandwidth: 32 CV score: 38.86713 
## Adaptive bandwidth: 31 CV score: 40.03792 
## Adaptive bandwidth: 31 CV score: 40.03792 
## Adaptive bandwidth: 30 CV score: 40.67171 
## Adaptive bandwidth: 30 CV score: 40.67171 
## Adaptive bandwidth: 29 CV score: 41.34116 
## Adaptive bandwidth: 29 CV score: 41.34116 
## Adaptive bandwidth: 28 CV score: 40.56964 
## Adaptive bandwidth: 28 CV score: 40.56964 
## Adaptive bandwidth: 27 CV score: 40.22093 
## Adaptive bandwidth: 27 CV score: 40.22093 
## Adaptive bandwidth: 26 CV score: 40.56296 
## Adaptive bandwidth: 26 CV score: 40.56296 
## Adaptive bandwidth: 25 CV score: 40.20452 
## Adaptive bandwidth: 25 CV score: 40.20452 
## Adaptive bandwidth: 24 CV score: 39.88842 
## Adaptive bandwidth: 24 CV score: 39.88842 
## Adaptive bandwidth: 23 CV score: 40.12316 
## Adaptive bandwidth: 23 CV score: 40.12316 
## Adaptive bandwidth: 22 CV score: 40.45429 
## Adaptive bandwidth: 22 CV score: 40.45429

Perform Geographically Weighted Regression

gwr_model <- gwr.basic(
  formula = gwr_formula,
  data = gwr_data,
  bw = bw, 
  adaptive = T)

gwr_result <- gwr_model$SDF
summary(gwr_result)
## Object of class SpatialPointsDataFrame
## Coordinates:
##          min        max
## X -123.89432 -115.36552
## Y   33.03547   41.74308
## Is projected: FALSE 
## proj4string : [+proj=longlat +datum=WGS84 +no_defs]
## Number of points: 58
## Data attributes:
##    Intercept       avg_DriveAlone_Dist avg_Driveothers_Dist avg_Passenger_Dist
##  Min.   :-0.4934   Min.   :0.9145      Min.   :0.8048       Min.   :0.8516    
##  1st Qu.: 0.4961   1st Qu.:0.9518      1st Qu.:0.9750       1st Qu.:0.8755    
##  Median : 0.8308   Median :0.9979      Median :1.0536       Median :0.9178    
##  Mean   : 0.9485   Mean   :1.0118      Mean   :1.0237       Mean   :0.9157    
##  3rd Qu.: 1.4073   3rd Qu.:1.0629      3rd Qu.:1.0721       3rd Qu.:0.9444    
##  Max.   : 2.5491   Max.   :1.1466      Max.   :1.1642       Max.   :1.0113    
##  avg_Walk_Dist     avg_Bike_Dist         y              yhat      
##  Min.   :-0.9423   Min.   :1.616   Min.   :18.13   Min.   :19.12  
##  1st Qu.: 1.6040   1st Qu.:2.339   1st Qu.:25.74   1st Qu.:25.73  
##  Median : 4.0746   Median :2.797   Median :28.15   Median :28.10  
##  Mean   : 3.5652   Mean   :2.692   Mean   :28.95   Mean   :28.98  
##  3rd Qu.: 5.6299   3rd Qu.:3.087   3rd Qu.:31.11   3rd Qu.:31.31  
##  Max.   : 6.9179   Max.   :3.916   Max.   :40.88   Max.   :40.47  
##     residual            CV_Score Stud_residual       Intercept_SE   
##  Min.   :-1.035504   Min.   :0   Min.   :-2.66169   Min.   :0.6972  
##  1st Qu.:-0.321663   1st Qu.:0   1st Qu.:-0.75237   1st Qu.:0.8132  
##  Median : 0.005585   Median :0   Median : 0.01299   Median :0.9091  
##  Mean   :-0.029618   Mean   :0   Mean   :-0.11165   Mean   :0.9201  
##  3rd Qu.: 0.244794   3rd Qu.:0   3rd Qu.: 0.49366   3rd Qu.:1.0281  
##  Max.   : 1.141008   Max.   :0   Max.   : 3.06768   Max.   :1.2451  
##  avg_DriveAlone_Dist_SE avg_Driveothers_Dist_SE avg_Passenger_Dist_SE
##  Min.   :0.04545        Min.   :0.07797         Min.   :0.07302      
##  1st Qu.:0.04917        1st Qu.:0.08661         1st Qu.:0.08168      
##  Median :0.05427        Median :0.09965         Median :0.09066      
##  Mean   :0.05475        Mean   :0.09926         Mean   :0.09005      
##  3rd Qu.:0.05991        3rd Qu.:0.10985         3rd Qu.:0.09689      
##  Max.   :0.07354        Max.   :0.12553         Max.   :0.11771      
##  avg_Walk_Dist_SE avg_Bike_Dist_SE  Intercept_TV     avg_DriveAlone_Dist_TV
##  Min.   :0.8861   Min.   :1.050    Min.   :-0.6075   Min.   :12.72         
##  1st Qu.:1.0182   1st Qu.:1.122    1st Qu.: 0.5059   1st Qu.:17.83         
##  Median :1.1827   Median :1.170    Median : 0.9482   Median :19.25         
##  Mean   :1.4982   Mean   :1.234    Mean   : 0.9975   Mean   :18.72         
##  3rd Qu.:1.5748   3rd Qu.:1.337    3rd Qu.: 1.6638   3rd Qu.:20.28         
##  Max.   :3.3180   Max.   :1.605    Max.   : 2.6042   Max.   :21.74         
##  avg_Driveothers_Dist_TV avg_Passenger_Dist_TV avg_Walk_Dist_TV 
##  Min.   : 6.411          Min.   : 7.903        Min.   :-0.6299  
##  1st Qu.: 9.098          1st Qu.: 9.168        1st Qu.: 1.0900  
##  Median :10.650          Median :10.414        Median : 2.0566  
##  Mean   :10.565          Mean   :10.306        Mean   : 2.9079  
##  3rd Qu.:12.364          3rd Qu.:11.350        3rd Qu.: 5.5485  
##  Max.   :13.466          Max.   :12.864        Max.   : 6.4995  
##  avg_Bike_Dist_TV    Local_R2     
##  Min.   :1.107    Min.   :0.9879  
##  1st Qu.:1.860    1st Qu.:0.9922  
##  Median :2.414    Median :0.9937  
##  Mean   :2.236    Mean   :0.9937  
##  3rd Qu.:2.723    3rd Qu.:0.9952  
##  Max.   :3.160    Max.   :0.9978
head(gwr_result)
## class       : SpatialPointsDataFrame 
## features    : 6 
## extent      : -122.8884, -118.8004, 35.38639, 38.52794  (xmin, xmax, ymin, ymax)
## crs         : +proj=longlat +datum=WGS84 +no_defs 
## variables   : 24
## names       :          Intercept, avg_DriveAlone_Dist, avg_Driveothers_Dist, avg_Passenger_Dist,    avg_Walk_Dist,    avg_Bike_Dist,                y,             yhat,           residual, CV_Score,     Stud_residual,      Intercept_SE, avg_DriveAlone_Dist_SE, avg_Driveothers_Dist_SE, avg_Passenger_Dist_SE, ... 
## min values  : -0.493407547944277,   0.931195113605537,    0.804772664961889,  0.882009855991313, 2.07589095658407, 1.93713278106022, 25.5660966027259, 26.1724569854569, -0.780451312709275,        0, -2.66168916077019, 0.812249140952043,     0.0454497445608674,      0.0779715804092963,     0.083079988605325, ... 
## max values  :    2.1567630022245,    1.14663162009085,     1.07245452213881,    1.0113345223335, 6.82075707909364, 3.53063802804537, 38.0691114580212, 38.8495627707305,  0.231925827521426,        0,  0.45129862727914,   1.1574966922564,     0.0668625157232842,       0.125534232867479,     0.110703232342877, ...
gwr_sf <- st_as_sf(gwr_result)
ggplot(data = gwr_sf) +
  geom_sf(aes(color = avg_DriveAlone_Dist)) + 
  geom_sf(data = boundaries, fill = NA, color = "black", linewidth = 0.2) +
  scale_color_viridis_c() +  
  theme_minimal() +
  labs(title = "Spatial Variation of avg_DriveAlone_Dist Coefficient", 
       color = "Coefficient Value")

ggplot(data = gwr_sf) +
  geom_sf(aes(color = avg_Driveothers_Dist)) +  
  scale_color_viridis_c() +  
  geom_sf(data = boundaries, fill = NA, color = "black", linewidth = 0.2) +
  theme_minimal() +
  labs(title = "Spatial Variation of avg_Driveothers_Dist Coefficient", 
       color = "Coefficient Value")

ggplot(data = gwr_sf) +
  geom_sf(aes(color = avg_Passenger_Dist)) +  
  geom_sf(data = boundaries, fill = NA, color = "black", linewidth = 0.2) +
  scale_color_viridis_c() + 
  theme_minimal() +
  labs(title = "Spatial Variation of avg_Passenger_Dist Coefficient", 
       color = "Coefficient Value")

ggplot(data = gwr_sf) +
  geom_sf(aes(color = avg_Walk_Dist)) +  
  geom_sf(data = boundaries, fill = NA, color = "black", linewidth = 0.2) +
  scale_color_viridis_c() +  
  theme_minimal() +
  labs(title = "Spatial Variation of avg_Walk_Dist Coefficient", 
       color = "Coefficient Value")

ggplot(data = gwr_sf) +
  geom_sf(aes(color = avg_Bike_Dist)) +  
  geom_sf(data = boundaries, fill = NA, color = "black", linewidth = 0.2) +
  scale_color_viridis_c() +  
  theme_minimal() +
  labs(title = "Spatial Variation of avg_Bike_Dist Coefficient", 
       color = "Coefficient Value")

interpret coefficient xxxx